Skip to content

Conversation

@MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Aug 7, 2025

Re-introduce: #18751 that was reverted in #18789 (explains why the PR was reverted in the first place).


Part of #18592

Testing strategy

  1. Run a full test suite: SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests
  2. Run a full test suite with Postgres: SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests
  3. Ensure it finishes without ballooning memory or the metric tests like test_exposed_to_prometheus taking long.

Dev notes

SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.util.test_task_scheduler.TestTaskSchedulerWithBackgroundWorker.test_schedule_task
SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.replication.tcp.streams.test_federation.FederationStreamTestCase.test_catchup

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

Fix `LaterGauge` metrics to collect from all servers

Follow-up to #18714

Previously, our `LaterGauge` metrics did include the `server_name` label
as expected but we were only seeing the last server being reported in
some cases. Any `LaterGauge` that we were creating multiple times was
only reporting the last instance.

This PR updates all `LaterGauge` to be created once and then we use
`LaterGauge.register_hook(...)` to add in the metric callback as before.
This works now because we store a list of callbacks instead of just one.

I noticed this problem thanks to some [tests in the Synapse Pro for
Small Hosts](element-hq/synapse-small-hosts#173)
repo that sanity check all metrics to ensure that we can see each metric
includes data from multiple servers.


### Testing strategy

1. This is only noticeable when you run multiple Synapse instances in
the same process.
 1. TODO

(see test that was added)

### Dev notes

Previous non-global `LaterGauge`:

```
synapse_federation_send_queue_xxx
synapse_federation_transaction_queue_pending_destinations
synapse_federation_transaction_queue_pending_pdus
synapse_federation_transaction_queue_pending_edus
synapse_handlers_presence_user_to_current_state_size
synapse_handlers_presence_wheel_timer_size
synapse_notifier_listeners
synapse_notifier_rooms
synapse_notifier_users
synapse_replication_tcp_resource_total_connections
synapse_replication_tcp_command_queue
synapse_background_update_status
synapse_federation_known_servers
synapse_scheduler_running_tasks
```



### Pull Request Checklist

<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->

* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
  - Use markdown where necessary, mostly for `code blocks`.
  - End with either a period (.) or an exclamation mark (!).
  - Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
We use `instance_id` instead of `server_name` because it's possible to have multiple
workers running in the same process with the same `server_name`.
Multiple `ReplicationCommandHandler` were getting created for the same HS
registering the hook twice

```
2025-08-07 18:38:14-0500 [-] synapse.replication.tcp.handler - 386 - ERROR - process-replication-data-11 - Failed to handle command <synapse.replication.tcp.commands.PositionCommand object at 0x7fb5fbe04320>
	Traceback (most recent call last):
	  File "synapse/replication/tcp/handler.py", line 384, in _unsafe_process_queue
	    await self._process_command(cmd, conn, stream_name)
	  File "synapse/replication/tcp/handler.py", line 397, in _process_command
	    await self._process_position(stream_name, conn, cmd)
	  File "synapse/replication/tcp/handler.py", line 721, in _process_position
	    (updates, current_token, missing_updates) = await stream.get_updates_since(
	                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	        cmd.instance_name, current_token, cmd.new_token
	        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	    )
	    ^
	  File "synapse/replication/tcp/streams/_base.py", line 213, in get_updates_since
	    updates, upto_token, limited = await self.update_function(
	                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
	    ...<4 lines>...
	    )
	    ^
	  File "synapse/replication/tcp/streams/_base.py", line 269, in update_function
	    result = await client(
	             ^^^^^^^^^^^^^
	    ...<4 lines>...
	    )
	    ^
	  File "synapse/logging/opentracing.py", line 929, in _wrapper
	    return await func(*args, **kwargs)
	           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
	  File "synapse/replication/http/_base.py", line 232, in send_request
	    streams = hs.get_replication_command_handler().get_streams_to_replicate()
	              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
	  File "synapse/server.py", line 232, in _get
	    dep = builder(self)
	  File "synapse/server.py", line 756, in get_replication_command_handler
	    return ReplicationCommandHandler(self)
	  File "synapse/replication/tcp/handler.py", line 258, in __init__
	    tcp_resource_total_connections_gauge.register_hook(
	    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
	        homeserver_instance_id=hs.get_instance_id(),
	        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	        hook=lambda: {(self.server_name,): len(self._connections)},
	        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	    )
	    ^
	  File "synapse/metrics/__init__.py", line 225, in register_hook
	    assert existing_hook is None, (
	           ^^^^^^^^^^^^^^^^^^^^^
	AssertionError: LaterGauge(name=synapse_replication_tcp_resource_total_connections) hook already registered for homeserver_instance_id=gZBmA. This is likely a Synapse bug and you forgot to unregister the previous hooks for the server (especially in tests).
```
See #18791 (comment)

This also fixes a long-standing problem where we would
only track metrics for the last database listed.
Comment on lines +154 to +161
# Track the background update status for each database
background_update_status.register_hook(
homeserver_instance_id=hs.get_instance_id(),
hook=lambda: {
(database.name(), server_name): database.updates.get_status()
for database in self.databases
},
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this out of the DatabasePool class because we create multiple DatabasePool when multiple databases are specified in the config. Previously, this would cause us to call register_hook multiple times for the same server_name which we assert to avoid dumb mistakes of not using the hs.get_xxx singletons. See #18791 (comment) for more background and other possible alternatives.

This also fixes a long-standing problem where we would only track metrics for the last database listed because LaterGauge previously just clobbered any previous gauge (notice the database_name label that we've added now).

Comment on lines 375 to 390
def cleanup(self) -> None:
"""
WIP: Clean-up any references to the homeserver and stop any running related
processes, timers, loops, replication stream, etc.
"""
logger.info("Received cleanup request for %s.", self.hostname)

# TODO: Stop background processes, timers, loops, replication stream, etc.

# Cleanup metrics associated with the homeserver
for later_gauge in all_later_gauges_to_clean_up_on_shutdown.values():
later_gauge.unregister_hooks_for_homeserver_instance_id(
self.get_instance_id()
)

logger.info("Cleanup complete for %s.", self.hostname)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major new part is this piece that cleans up metrics from each homeserver in the tests.

Previously, the list of hooks built up until our CI machines couldn't operate properly, see #18789

@MadLittleMods MadLittleMods marked this pull request as ready for review August 8, 2025 06:04
@MadLittleMods MadLittleMods requested a review from a team as a code owner August 8, 2025 06:04
Copy link
Member

@devonh devonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look like they address the memory blowup issue that the previous iteration of this PR had.
Do you think we should have another set of eyes go over the changes here?

@@ -0,0 +1 @@
Fix `LaterGauge` metrics to collect from all servers.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should have another set of eyes go over the changes here?

-- @devonh, #18791 (review)

Sounds like a good idea given the new cleanup pattern and you were the same one that reviewed the original PR.

@MadLittleMods MadLittleMods requested a review from a team August 12, 2025 15:22
I think this was accidentally removed before.

Testing strategy:
```
poetry run generate_workers_map --config-path homeserver.yaml
```
@MadLittleMods MadLittleMods merged commit bff4a11 into develop Sep 2, 2025
44 checks passed
@MadLittleMods MadLittleMods deleted the madlittlemods/re-introduce-18751 branch September 2, 2025 17:14
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @devonh and @erikjohnston 🐊

itsoyou pushed a commit to famedly/synapse that referenced this pull request Oct 13, 2025
…18791)

Re-introduce: element-hq/synapse#18751 that was
reverted in element-hq/synapse#18789 (explains
why the PR was reverted in the first place).

- Adds a `cleanup` pattern that cleans up metrics from each homeserver
in the tests. Previously, the list of hooks built up until our CI
machines couldn't operate properly, see
element-hq/synapse#18789
- Fix long-standing issue with `synapse_background_update_status`
metrics only tracking the last database listed in the config (see
element-hq/synapse#18791 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants